Dealing with unknown words in statistical machine translation
نویسندگان
چکیده
In Statistical Machine Translation, words that were not seen during training are unknown words, that is, words that the system will not know how to translate. In this paper we contribute to this research problem by profiting from orthographic cues given by words. Thus, we report a study of the impact of word distance metrics in cognates’ detection and, in addition, on the possibility of obtaining possible translations of unknown words through Logical Analogy. Our approach is tested in the translation of corpora from Portuguese to English (and vice-versa).
منابع مشابه
Overcoming Vocabulary Sparsity in MT Using Lattices
Source languages with complex wordformation rules present a challenge for statistical machine translation (SMT). In this paper, we take on three facets of this challenge: (1) common stems are fragmented into many different forms in training data, (2) rare and unknown words are frequent in test data, and (3) spelling variation creates additional sparseness problems. We present a novel, lightweig...
متن کاملA new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملStatistical Machine Translation for Twitter
We consider the problem of translating short messages (Tweets) using Europarl as a starting-point. After highlighting some of the domain differences between Europarl and Twitter, we show that for German-English translation, we can improve performance from a baseline BLEU score of 25.58 to 53.45. By far and away the single most important improvement is passing-through unknown words (which are ma...
متن کاملHandling Unknown Words in Statistical Machine Translation from a New Perspective
Unknown words are one of the key factors which drastically impact the translation quality. Traditionally, nearly all the related research work focus on obtaining the translation of the unknown words in different ways. In this paper, we propose a new perspective to handle unknown words in statistical machine translation. Instead of trying great effort to find the translation of unknown words, th...
متن کاملAnalogical translation of unknown words in a statistical machine translation framework
In this paper we address the problem of translating unknown words in a statistical machine translation framework. In data-driven machine translation, words that are not seen in the data may not be translated and are either discarded or left as is in the output. They are refered to as unknown words. The unknown word problem increases when the available bilingual data is scarce. In order to addre...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012